Succinct and Informative Cluster Descriptions for Document Repositories

نویسندگان

Lijun Chen

Guozhu Dong

چکیده

Large document repositories need to be organized, summarized and labeled in order to be used effectively. Previous clustering studies focused on organizing, and paid little attention to producing cluster labels. Without informative labels, users need to browse many documents to get a sense of what the clusters contain. Human labeling of clusters is not viable when clustering is performed on demand or for very few users. It is desirable to automatically generate informative cluster descriptions (CDs), in order to give users a high-level sense about the clusters, and to help repository managers to produce the final cluster labels. This paper studies CDs in the form of small term sets for document clusters, and investigates how to measure the quality or fidelity of CDs and how to construct high quality CDs. We propose to use a CD-based classification for simulating how to interpret CDs, and to use the Fscore of the classification to measure CD quality. Since directly searching good CDs using F-score is too expensive, we consider a surrogate quality measure, the CDD measure, which combines three factors: coverage, disjointness, and diversity. We give a search strategy for constructing CDs, namely a layer-based replacement method called PagodaCD . Experimental results show that the algorithm is efficient and can produce high quality CDs. CDs produced by PagodaCD also exhibit a monotone quality behavior.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

باغ سعاد‌ت‌آباد اصفهان در آینۀ مثنوی گلزار سعادت

In the study of those historic monuments that are more or less ruined, the reflections and images in written historical manuscripts are of utmost importance. In these cases, any succinct references can be valuable and informative regarding to the historical status and architectural configuration and formal composition of these monuments. One of the less studied resources for architectural histo...

متن کامل

Evaluation of EAP Programs in Iran: Document Analysis and Expert ‎Perspectives

This study aimed to examine the policies in the Iranian English for Academic Purposes (EAP) education and the extent to which objectives match the policies and are materialized in practice. To this end, course descriptions in the syllabi for the EAP programs were evaluated through document analysis and triangulated with the experts’ perspectives through interviews to examine the current status ...

متن کامل

Athena: Text Mining Based Discovery of Scientific Workflows in Disperse Repositories

Scientific workflows are abstractions used to model and execute in silico scientific experiments. They represent key resources for scientists and are enacted and managed by engines called Scientific Workflow Management Systems (SWfMS). Each SWfMS has a particular workflow language. This heterogeneity of languages and formats poses as complex scenario for scientists to search or discover workflo...

متن کامل

Document Classification via Structure Synopses

Information available in the Internet is frequently supplied simply as plain ascii text, structured according to orthographic and semantic conventions. Traditional document classification is typically formulated as a learning problem where each instance is a whole document that is represented by a feature vector. Such feature vectors are often generated based on the appearance and frequencies o...

متن کامل

Concept-based Text Clustering

Thematic organization of text is a natural practice of humans and a crucial task for today’s vast repositories. Clustering automates this by assessing the similarity between texts and organizing them accordingly, grouping like ones together and separating those with different topics. Clusters provide a comprehensive logical structure that facilitates exploration, search and interpretation of cu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Succinct and Informative Cluster Descriptions for Document Repositories

نویسندگان

چکیده

منابع مشابه

باغ سعاد‌ت‌آباد اصفهان در آینۀ مثنوی گلزار سعادت

Evaluation of EAP Programs in Iran: Document Analysis and Expert ‎Perspectives

Athena: Text Mining Based Discovery of Scientific Workflows in Disperse Repositories

Document Classification via Structure Synopses

Concept-based Text Clustering

عنوان ژورنال:

اشتراک گذاری